Take Home Exercise 3

Published

March 26, 2023

Modified

March 20, 2023

1 Setting The Scene

Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.

2 The Task

In this take-home exercise, you are tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. You are also required to compare the performance of the conventional OLS method versus the geographical weighted methods.

3 The Data

Structural and Locational factors will be extracted:

  • Structural factors

    • Area of the unit

    • Floor level

    • Remaining lease

    • Age of the unit

  • Locational factors

    • Proxomity to CBD

    • Proximity to eldercare

    • Proximity to hawker centres

    • Proximity to MRT

    • Proximity to park

    • Proximity to good primary school

    • Proximity to shopping mall

    • Proximity to supermarket

    • Number of kindergartens within 350m

    • Number of childcare centres within 350m

    • Number of bus stop within 350m

    • Number of primary schools within 1km

The following table describes the data sources used in this assignment.

datasets <- data.frame(
  Type=c("Aspatial",
         "Geospatial",
         "Geospatial - Extracted",
         "Geospatial",
         "Geospatial",
         "Geospatial",
         "Geospatial",
         "Geospatial - Extracted",
         "Geospatial - Extracted",
         "Geospatial - Extracted",
         "Geospatial - Extracted",
         "Geospatial",
         "Geospatial",
         "Geospatial"
         ),
  
  Name=c("Resale Flat Prices",
         "Master Plan 2019 Subzone Boundary (Web)",
         "Central Business District Coordinates",
         "Eldercare Services",
         "Hawker Centres",
         "MRT Stations",
         "Parks",
         "Good Primary Schools",
         "Shopping Mall",
         "Supermarkets",
         "Kindergartens",
         "Childcare Services",
         "Bus Stops",
         "Primary Schools"),
  
  Format=c(".csv", 
           ".shp", 
           ".shp", 
           ".shp", 
           ".geojson",
           ".shp", 
           ".kml", 
           ".shp", 
           ".geojson",
           ".geojson", 
           ".shp",
           ".geojson",
           ".shp",
           ".xlsx"
           ),
  
  Source=c("[data.gov.sg](https://data.gov.sg/dataset/resale-flat-prices)",
           "[In-Class Ex 9](https://data.gov.sg/dataset/master-plan-2019-subzone-boundary-no-sea)",
           "[latlong.net](https://www.latlong.net/place/downtown-core-singapore-20616.html)",
           "[data.gov.sg](https://data.gov.sg/dataset/eldercare-services)",
           "[data.gov.sg](https://data.gov.sg/dataset/hawker-centres)",
           "[Datamall](https://datamall.lta.gov.sg/content/dam/datamall/datasets/Geospatial/TrainStationExit.zip)",
           "[data.gov.sg](https://data.gov.sg/dataset/parks)",
           
           "[salary.sg](https://www.salary.sg/2022/best-primary-schools-2022-by-popularity/)",
           "[Mall SVY21 Coordinates Web Scaper](https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper)",
           "[data.gov.sg](https://data.gov.sg/dataset/supermarkets)",
           "[OneMap API](https://www.onemap.gov.sg/docs/)",
           "[Hands-on Ex 3](https://annatrw-is415.netlify.app/hands-on_ex/hands-on_ex03/hands-on_ex03)",
           "[Datamall](https://datamall.lta.gov.sg/content/dam/datamall/datasets/Geospatial/BusStopLocation.zip)",
           "[data.gov.sg](https://data.gov.sg/dataset/school-directory-and-information)"
           )
  )

# with reference to this guide on kableExtra:
# https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
# kable_material is the name of the kable theme
# 'hover' for to highlight row when hovering, 'scale_down' to adjust table to fit page width
library(knitr)
library(kableExtra)
kable(datasets, caption="Datasets Used") %>%
  kable_material("hover", latex_options="scale_down")
Datasets Used
Type Name Format Source
Aspatial Resale Flat Prices .csv [data.gov.sg](https://data.gov.sg/dataset/resale-flat-prices)
Geospatial Master Plan 2019 Subzone Boundary (Web) .shp [In-Class Ex 9](https://data.gov.sg/dataset/master-plan-2019-subzone-boundary-no-sea)
Geospatial - Extracted Central Business District Coordinates .shp [latlong.net](https://www.latlong.net/place/downtown-core-singapore-20616.html)
Geospatial Eldercare Services .shp [data.gov.sg](https://data.gov.sg/dataset/eldercare-services)
Geospatial Hawker Centres .geojson [data.gov.sg](https://data.gov.sg/dataset/hawker-centres)
Geospatial MRT Stations .shp [Datamall](https://datamall.lta.gov.sg/content/dam/datamall/datasets/Geospatial/TrainStationExit.zip)
Geospatial Parks .kml [data.gov.sg](https://data.gov.sg/dataset/parks)
Geospatial - Extracted Good Primary Schools .shp [salary.sg](https://www.salary.sg/2022/best-primary-schools-2022-by-popularity/)
Geospatial - Extracted Shopping Mall .geojson [Mall SVY21 Coordinates Web Scaper](https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper)
Geospatial - Extracted Supermarkets .geojson [data.gov.sg](https://data.gov.sg/dataset/supermarkets)
Geospatial - Extracted Kindergartens .shp [OneMap API](https://www.onemap.gov.sg/docs/)
Geospatial Childcare Services .geojson [Hands-on Ex 3](https://annatrw-is415.netlify.app/hands-on_ex/hands-on_ex03/hands-on_ex03)
Geospatial Bus Stops .shp [Datamall](https://datamall.lta.gov.sg/content/dam/datamall/datasets/Geospatial/BusStopLocation.zip)
Geospatial Primary Schools .xlsx [data.gov.sg](https://data.gov.sg/dataset/school-directory-and-information)

4 Importing Required R Packages

GWmodel

pacman::p_load(tidygeocoder, sf, tidyverse, tmap, jsonlite, rvest, onemapsgapi, olsrr, corrplot, ggpubr, spdep, GWmodel, gtsummary, ggthemes, zoo, SpatialML, tmap, rsample, Metrics)

5 Structural Factors

5.1 Import HDB Resale Prices Data

Using read_csv(), import the HDB resale price data.

resale_all <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

This is how the resale data looks like.

glimpse(resale_all)

Filter out 2021 and 2022 data as per assignment requirement using grepl().

resale2122 <- filter(resale_all, grepl('2021', month)|grepl('2022', month))

Filter out 2023 data for testing.

resale23_test <- filter(resale_all, grepl('2023', month))

For the purposes of this assignment, we will be focusing on 5 Room HDB resale flat prices.

unique(resale2122$flat_type)
resale2122_5r <- filter(resale2122, flat_type == '5 ROOM')
resale23_test <- filter(resale23_test, flat_type == '5 ROOM')

5.2 Geocode Resale Data

Referencing senior’s work, this geocoding function using onemap API will retrieve the latitude and longitude for resale data.

library(httr)
geocode_function <- function(block, street_name) {
  base_url <- "https://developers.onemap.sg/commonapi/search"
  address <- paste(block, street_name, sep = " ")
  query <- list("searchVal" = address, 
                "returnGeom" = "Y",
                "getAddrDetails" = "N",
                "pageNum" = "1")
  
  res <- GET(base_url, query = query)
  restext<-content(res, as="text")
  
  output <- fromJSON(restext)  %>% 
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)

  return(output)
}
resale2122_5r$street_name <- gsub("ST\\.", "SAINT", resale2122_5r$street_name)

Execute the geocoding function

resale2122_5r$LATITUDE <- 0
resale2122_5r$LONGITUDE <- 0

for (i in 1:nrow(resale2122_5r)){
  temp_output <- geocode_function(resale2122_5r[i, 4], resale2122_5r[i, 5])
  
  resale2122_5r$LATITUDE[i] <- temp_output$results.LATITUDE
  resale2122_5r$LONGITUDE[i] <- temp_output$results.LONGITUDE
}

We check if there are null values in the coordinate columns, in which there are none.

sum(is.na(resale2122_5r$LATITUDE))
sum(is.na(resale2122_5r$LONGITUDE))
resale23_test$street_name <- gsub("ST\\.", "SAINT", resale23_test$street_name)

Execute the geocoding function

resale23_test$LATITUDE <- 0
resale23_test$LONGITUDE <- 0

for (i in 1:nrow(resale23_test)){
  temp_output <- geocode_function(resale23_test[i, 4], resale23_test[i, 5])
  
  resale23_test$LATITUDE[i] <- temp_output$results.LATITUDE
  resale23_test$LONGITUDE[i] <- temp_output$results.LONGITUDE
}
sum(is.na(resale23_test$LATITUDE))
sum(is.na(resale23_test$LONGITUDE))

5.3 Remaining Lease

Referencing senior’s work, the remaining lease in years is retrieved by converting years and months into years.

str_list <- str_split(resale2122_5r$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      resale2122_5r$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    resale2122_5r$remaining_lease[i] <- year
  }
}
str_list <- str_split(resale23_test$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      resale23_test$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    resale23_test$remaining_lease[i] <- year
  }
}

5.4 Floor Level

Referencing senior’s work, the floor level is retrieved by first sorting the storeys in ascending order, then assigning a numerical value to each category.

storeys <- sort(unique(resale2122_5r$storey_range))
storey_order <- 1:length(storeys)
storey_range_order <- data.frame(storeys, storey_order)
resale2122_5r <- left_join(resale2122_5r, storey_range_order, by= c("storey_range" = "storeys"))
storeys <- sort(unique(resale23_test$storey_range))
storey_order <- 1:length(storeys)
storey_range_order <- data.frame(storeys, storey_order)
resale23_test <- left_join(resale23_test, storey_range_order, by= c("storey_range" = "storeys"))

5.5 Age of Unit

Convert the date field to a date data type, which assumes the first day of the month since date is in yyyy/mm format; referencing sources online.

date <- as.Date(as.yearmon(resale2122_5r$month))
resale2122_5r$start <- as.numeric(format(date,'%Y'))

Age of unit calculation will only consider the year when the flat was sold.

resale2122_5r$UNIT_AGE <- resale2122_5r$start - resale2122_5r$lease_commence_date
date <- as.Date(as.yearmon(resale23_test$month))
resale23_test$start <- as.numeric(format(date,'%Y'))
resale23_test$UNIT_AGE <- resale23_test$start - resale23_test$lease_commence_date

5.6 Convert Resale Data To sf Object

Convert the dataframe into an sf object with geometry field attached, referencing here.

resale2122_5r_sf <- st_as_sf(resale2122_5r, 
                      coords = c("LONGITUDE", 
                                 "LATITUDE"), 
                      # the geographical features are in longitude & latitude, in decimals
                      # as such, WGS84 is the most appropriate coordinates system
                      crs=4326) %>%
  #afterwards, we transform it to SVY21, our desired CRS
  st_transform(crs = 3414)
resale23_test_sf <- st_as_sf(resale23_test, 
                      coords = c("LONGITUDE", 
                                 "LATITUDE"), 
                      # the geographical features are in longitude & latitude, in decimals
                      # as such, WGS84 is the most appropriate coordinates system
                      crs=4326) %>%
  #afterwards, we transform it to SVY21, our desired CRS
  st_transform(crs = 3414)

6 Locational Factors

As mentioned, the following locational factors will be used for predictive modelling of HDB resale prices.

  • Proximity to CBD
  • Proximity to eldercare services
  • Proximity to hawker centres
  • Proximity to MRT stations
  • Proximity to parks
  • Proximity to good primary schools
  • proximity to shopping malls
  • proximity to supermarkets
  • Number of kindergarten within 350m
  • Number of child care services within 350m
  • Number of bus stops within 350m
  • Number of primary schools within 1km

6.1 Factors With Geographic Coordinates

6.1.1 Self-sourced and referenced

6.1.1.1 Kindergartens

Using OneMap API retrieval methods, my personal token was pre-loaded into the token variable and used to call the OneMap API to retrieve location of kindergartens in Singapore as an sf object.

search_themes(token, "kindergarten")
get_theme_status(token, "kindergartens")
themetibble <- get_theme(token, "kindergartens")
kindergarten_sf <- st_as_sf(themetibble, coords=c("Lng", "Lat"), crs=4326)

6.1.1.2 Shopping Malls

With reference to a previously done project to scrape the locations of shopping malls in Singapore, read in the csv and convert it to an sf object with appropriate projection system.

mall_csv <- read_csv("data/geospatial/mall_coordinates_updated.csv")
glimpse(mall_csv)
malls_sf <- st_as_sf(mall_csv,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

6.1.2 Geojson Sources

Read in the geojson files.

elder_sf <- st_read(dsn = "data/geospatial/ElderCare", layer="ELDERCARE")
mrt_sf <- st_read(dsn = "data/geospatial/TrainStationExit", layer="Train_Station_Exit_Layer")
bus_sf <- st_read(dsn = "data/geospatial/BusStop_Feb2023", layer="BusStop")

hawker_sf <- st_read("data/geospatial/hawker-centres-geojson.geojson") 
parks_sf <- st_read("data/geospatial/parks.kml") 
supermkt_sf <- st_read("data/geospatial/supermarkets-geojson.geojson") 
childcare_sf <- st_read("data/geospatial/child-care-services-geojson.geojson") 

mpsz_sf <- st_read(dsn="data/geospatial/MPSZ2019", layer = "MPSZ-2019") 
# kiv might delete/ write rds/ eval false
mpsz_sf <- st_read(dsn="data/geospatial/MPSZ2019", layer = "MPSZ-2019")
Reading layer `MPSZ-2019' from data source 
  `C:\annatrw\IS415\Take-home_Ex\Take-home_Ex03\data\geospatial\MPSZ2019' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
mpsz_sf <- st_set_crs(mpsz_sf, 3414)
mpsz_sf <- sf::st_make_valid(mpsz_sf)
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 0

6.1.2.1 CRS

Check the CRS for the layers.

st_crs(elder_sf)
st_crs(mrt_sf)
st_crs(bus_sf)
st_crs(hawker_sf)
st_crs(parks_sf)
st_crs(supermkt_sf)
st_crs(childcare_sf)
st_crs(mpsz_sf)
st_crs(kindergarten_sf)
st_crs(malls_sf)

They are projected wrongly - using EPSG 9001 - when it should be EPSG 4326 instead.

Set correct CRS except for malls_sf which is already EPSG3414 done in previous section when transforming to sf object.

mpsz_sf <- st_set_crs(mpsz_sf, 3414)
elder_sf <- st_set_crs(elder_sf, 3414)
mrt_sf <- st_set_crs(mrt_sf, 3414)
bus_sf <- st_set_crs(bus_sf, 3414)

hawker_sf <- hawker_sf %>%
  st_transform(crs = 3414)
parks_sf <- parks_sf %>%
  st_transform(crs = 3414)
supermkt_sf <- supermkt_sf %>%
  st_transform(crs = 3414)
childcare_sf <- childcare_sf %>%
  st_transform(crs = 3414)
kindergarten_sf <- kindergarten_sf %>%
  st_transform(crs = 3414)

Verify that there are no invalid geometries:

length(which(st_is_valid(mpsz_sf) == FALSE))

length(which(st_is_valid(elder_sf) == FALSE))
length(which(st_is_valid(mrt_sf) == FALSE))
length(which(st_is_valid(bus_sf) == FALSE))
length(which(st_is_valid(hawker_sf) == FALSE))
length(which(st_is_valid(supermkt_sf) == FALSE))
length(which(st_is_valid(parks_sf) == FALSE))
length(which(st_is_valid(childcare_sf) == FALSE))
length(which(st_is_valid(kindergarten_sf) == FALSE))
length(which(st_is_valid(malls_sf) == FALSE))

However, there are invalid geometries for mpsz_sf; which can be easily rectified with st_make_valid().

mpsz_sf <- sf::st_make_valid(mpsz_sf)
length(which(st_is_valid(mpsz_sf) == FALSE))

6.1.3 Calculate Proximity

Using a proximity function from senior’s work, st_distance is used to compute a distance matrix which will retrieve the minimum distance from origin to destination.

get_prox <- function(origin_df, dest_df, col_name){
  
  # creates a matrix of distances
  dist_matrix <- st_distance(origin_df, dest_df)           
  
  # find the nearest location_factor and create new data frame
  near <- origin_df %>% 
    mutate(PROX = apply(dist_matrix, 1, function(x) min(x)) / 1000) 
  
  # rename column name according to input parameter
  names(near)[names(near) == 'PROX'] <- col_name

  # Return df
  return(near)
}
resale2122_5r_sf <- get_prox(resale2122_5r_sf, elder_sf, "PROX_ELDERCARE") 
resale2122_5r_sf <- get_prox(resale2122_5r_sf, mrt_sf, "PROX_MRT")
resale2122_5r_sf <- get_prox(resale2122_5r_sf, hawker_sf, "PROX_HAWKER") 
resale2122_5r_sf <- get_prox(resale2122_5r_sf, parks_sf, "PROX_PARK") 
resale2122_5r_sf <- get_prox(resale2122_5r_sf, supermkt_sf, "PROX_SUPERMARKET")
resale2122_5r_sf <- get_prox(resale2122_5r_sf, malls_sf, "PROX_MALL")
resale23_test_sf <- get_prox(resale23_test_sf, elder_sf, "PROX_ELDERCARE") 
resale23_test_sf <- get_prox(resale23_test_sf, mrt_sf, "PROX_MRT")
resale23_test_sf <- get_prox(resale23_test_sf, hawker_sf, "PROX_HAWKER") 
resale23_test_sf <- get_prox(resale23_test_sf, parks_sf, "PROX_PARK") 
resale23_test_sf <- get_prox(resale23_test_sf, supermkt_sf, "PROX_SUPERMARKET")
resale23_test_sf <- get_prox(resale23_test_sf, malls_sf, "PROX_MALL")

6.2 Calculate Number of Amenities

Using a function to get the number of amenities within a certain distance from senior’s work, the get within function counts the number of amenities within the threshold distance.

get_within <- function(origin_df, dest_df, threshold_dist, col_name){
  
  # creates a matrix of distances
  dist_matrix <- st_distance(origin_df, dest_df)   
  
  # count the number of location_factors within threshold_dist and create new data frame
  wdist <- origin_df %>% 
    mutate(WITHIN_DT = apply(dist_matrix, 1, function(x) sum(x <= threshold_dist)))
  
  # rename column name according to input parameter
  names(wdist)[names(wdist) == 'WITHIN_DT'] <- col_name

  # Return df
  return(wdist)
}

6.2.0.1 Number of Kindergartens Within 350m

resale2122_5r_sf <- get_within(resale2122_5r_sf, kindergarten_sf, 350, "WITHIN_350M_KINDERGARTEN")

6.2.0.2 Number of Childcare Centres Within 350m

resale2122_5r_sf <- get_within(resale2122_5r_sf, childcare_sf, 350, "WITHIN_350M_CHILDCARE")

6.2.0.3 Number of Bus Stops Within 350m

resale2122_5r_sf <- get_within(resale2122_5r_sf, bus_sf, 350, "WITHIN_350M_BUS")

6.2.0.4 Number of Kindergartens Within 350m

resale23_test_sf <- get_within(resale23_test_sf, kindergarten_sf, 350, "WITHIN_350M_KINDERGARTEN")

6.2.0.5 Number of Childcare Centres Within 350m

resale23_test_sf <- get_within(resale23_test_sf, childcare_sf, 350, "WITHIN_350M_CHILDCARE")

6.2.0.6 Number of Bus Stops Within 350m

resale23_test_sf <- get_within(resale23_test_sf, bus_sf, 350, "WITHIN_350M_BUS")

6.3 Factors Without Geographic Coordinates

6.3.1 CBD

Get the coordinates of Singapore’s Central Business District in Downtown Core using latlong.net: 1.287953, 103.851784

name <- c('CBD')
latitude= c(1.287953)
longitude= c(103.851784)
cbd <- data.frame(name, latitude, longitude)
cbd_sf <- st_as_sf(cbd, coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)
st_crs(cbd_sf)

Upon checking, the CBD data is in the correct CRS.

6.3.1.1 Get Proximity To CBD

This is done by calling the earlier defined get_prox function.

resale2122_5r_sf <- get_prox(resale2122_5r_sf, cbd_sf, "PROX_CBD") 
resale23_test_sf <- get_prox(resale23_test_sf, cbd_sf, "PROX_CBD") 

6.3.2 Primary Schools

Read the csv of schools and filter out the primary schools for study.

pri_sch <- read_csv("data/geospatial/general-information-of-schools.csv")
pri_sch <- pri_sch %>%
  filter(mainlevel_code == "PRIMARY") %>%
  select(school_name, address, postal_code, mainlevel_code)
glimpse(pri_sch)

6.3.2.1 Geocode

Referencing this link, define a get_coords() function that utilises the OneMap API, feeding the postal codes through it to get latitude and longitude.

get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://developers.onemap.sg/commonapi/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

We extract the unique postal codes and feed them through the get_coords function.

prisch_postal <- sort(unique(pri_sch$postal_code))
prisch_coords <- get_coords(prisch_postal)

This code checks for any null values in the retrieved coordinates data.

prisch_coords[(is.na(prisch_coords$postal) | is.na(prisch_coords$latitude) | is.na(prisch_coords$longitude)), ]
glimpse(prisch_coords)

We can safely combine the retrieved coordinates with the original primary schools csv data.

prisch_coords = prisch_coords[c("postal","latitude", "longitude")]
pri_sch <- left_join(pri_sch, prisch_coords, by = c('postal_code' = 'postal'))

Lastly, we convert to it to an sf object with the correct projection.

prisch_sf <- st_as_sf(pri_sch,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

6.3.2.2 Number of Primary Schools Within 1km

Run the primary schools data through the get_within function to get the number of primary schools within 1km.

resale2122_5r_sf <- get_within(resale2122_5r_sf, prisch_sf, 1000, "WITHIN_1KM_PRISCH")
resale23_test_sf <- get_within(resale23_test_sf, prisch_sf, 1000, "WITHIN_1KM_PRISCH")

6.4 Good Primary Schools

With reference to salary.sg, we can extract the top 10 primary schools in 2022. Using the method employed by senior, the following web crawling method is used to search for the html element - which is a list item - on the web page to retrieve the top 10 primary schools.

url <- "https://www.salary.sg/2022/best-primary-schools-2022-by-popularity/"

good_pri <- data.frame()

schools <- read_html(url) %>%
  html_nodes(xpath = paste('//*[@id="post-33132"]/div[3]/div/div/ol/li') ) %>%
  html_text() 

for (i in (schools)){
  sch_name <- toupper(gsub(" – .*","",i))
  sch_name <- gsub("\\(PRIMARY SECTION)","",sch_name)
  sch_name <- trimws(sch_name)
  new_row <- data.frame(pri_sch_name=sch_name)
  # Add the row
  good_pri <- rbind(good_pri, new_row)
}

top_good_pri <- head(good_pri, 10)
top_good_pri

Check which top schools are not in prisch_sf.

top_good_pri$pri_sch_name[!top_good_pri$pri_sch_name %in% prisch_sf$school_name]

Get the coordinates list of good primary schools and check for null values, upon which we see that St Hilda’s primary school does not exist in the prisch_sf data.

good_pri_list <- unique(top_good_pri$pri_sch_name)
good_pri_list
goodprisch_coords <- get_coords(good_pri_list)
goodprisch_coords
goodprisch_coords[(is.na(goodprisch_coords$postal) | is.na(goodprisch_coords$latitude) | is.na(goodprisch_coords$longitude)), ]

To rectify this, change the apostrophe symbol for the school title “ST HILDA’S PRIMARY SCHOOL”.

top_good_pri$pri_sch_name[top_good_pri$pri_sch_name == "ST. HILDA’S PRIMARY SCHOOL"] <- "ST. HILDA'S PRIMARY SCHOOL"

Here, we check once again for null values in the new coordinates list.

good_pri_list <- unique(top_good_pri$pri_sch_name)
good_pri_list
goodprisch_coords <- get_coords(good_pri_list)
goodprisch_coords
goodprisch_coords[(is.na(goodprisch_coords$postal) | is.na(goodprisch_coords$latitude) | is.na(goodprisch_coords$longitude)), ]

Finally, convert the layer into an sf object with the correct projection.

good_pri_sf <- st_as_sf(goodprisch_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)
resale2122_5r_sf <- get_prox(resale2122_5r_sf, good_pri_sf, "PROX_GOOD_PRISCH")
resale23_test_sf <- get_prox(resale23_test_sf, good_pri_sf, "PROX_GOOD_PRISCH")

6.5 Other Data Processing

For the last steps of data processing, we retain necessary columns by dropping unnecessary one using subset() as well as convert the remaining lease field from a string data type to numeric.

resale2122_5r_sf <- subset(resale2122_5r_sf, select = -c(month, town, flat_type, storey_range, block, street_name, flat_model, lease_commence_date, start)) %>% select(resale_price, everything())
resale2122_5r_sf$remaining_lease <- as.numeric(resale2122_5r_sf$remaining_lease)
glimpse(resale2122_5r_sf)
resale23_test_sf <- subset(resale23_test_sf, select = -c(month, town, flat_type, storey_range, block, street_name, flat_model, lease_commence_date, start)) %>% select(resale_price, everything())
resale23_test_sf$remaining_lease <- as.numeric(resale23_test_sf$remaining_lease)

6.6 Write Data to rds File

train <- write_rds(resale2122_5r_sf, "data/rds/factors.rds")
test <- write_rds(resale23_test_sf, "data/rds/factors_test.rds")

7 EDA

train <- read_rds("data/rds/factors.rds")
test <- read_rds("data/rds/factors_test.rds")
glimpse(train)
Rows: 14,519
Columns: 18
$ resale_price             <dbl> 483000, 590000, 629000, 670000, 680000, 76000…
$ floor_area_sqm           <dbl> 118, 123, 118, 128, 133, 133, 110, 110, 110, …
$ remaining_lease          <dbl> 59.08, 55.58, 58.67, 74.25, 71.25, 74.50, 84.…
$ storey_order             <int> 1, 5, 6, 3, 1, 5, 8, 5, 6, 6, 5, 5, 8, 1, 2, …
$ UNIT_AGE                 <dbl> 40, 44, 41, 25, 28, 25, 15, 19, 15, 10, 10, 1…
$ geometry                 <POINT [m]> POINT (30820.82 39547.58), POINT (29412…
$ PROX_ELDERCARE           <dbl> 1.0646617, 0.1908834, 0.7891907, 0.1476040, 0…
$ PROX_MRT                 <dbl> 1.0808607, 0.5243923, 0.4159309, 0.2427019, 0…
$ PROX_HAWKER              <dbl> 0.4828156, 0.3317637, 0.3792242, 0.5884497, 0…
$ PROX_PARK                <dbl> 0.7359373, 0.5808933, 0.3087999, 0.2838337, 0…
$ PROX_SUPERMARKET         <dbl> 0.41991387, 0.24554343, 0.31805791, 0.3135757…
$ PROX_MALL                <dbl> 1.2132871, 0.4416229, 0.5494572, 1.5369839, 0…
$ WITHIN_350M_KINDERGARTEN <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, …
$ WITHIN_350M_CHILDCARE    <int> 2, 5, 3, 3, 2, 3, 5, 2, 5, 4, 5, 4, 4, 2, 2, …
$ WITHIN_350M_BUS          <int> 2, 8, 6, 11, 6, 8, 4, 9, 5, 9, 7, 9, 9, 7, 6,…
$ PROX_CBD                 <dbl> 9.537543, 8.663223, 9.449166, 9.211988, 8.935…
$ WITHIN_1KM_PRISCH        <int> 1, 3, 2, 3, 3, 3, 4, 2, 2, 2, 2, 2, 2, 3, 2, …
$ PROX_GOOD_PRISCH         <dbl> 2.6201989, 1.2626796, 1.8786592, 0.4463958, 1…
summary(train)
  resale_price     floor_area_sqm  remaining_lease  storey_order   
 Min.   : 350000   Min.   : 99.0   Min.   :49.08   Min.   : 1.000  
 1st Qu.: 528000   1st Qu.:112.0   1st Qu.:69.92   1st Qu.: 2.000  
 Median : 595000   Median :115.0   Median :77.83   Median : 3.000  
 Mean   : 627297   Mean   :117.4   Mean   :77.61   Mean   : 3.514  
 3rd Qu.: 690000   3rd Qu.:122.0   3rd Qu.:89.08   3rd Qu.: 5.000  
 Max.   :1418000   Max.   :167.0   Max.   :96.75   Max.   :17.000  
    UNIT_AGE              geometry     PROX_ELDERCARE      PROX_MRT      
 Min.   : 2.00   POINT        :14519   Min.   :0.0000   Min.   :0.02293  
 1st Qu.:10.00   epsg:3414    :    0   1st Qu.:0.3561   1st Qu.:0.25435  
 Median :21.00   +proj=tmer...:    0   Median :0.6375   Median :0.43919  
 Mean   :21.45                         Mean   :0.8264   Mean   :0.53943  
 3rd Qu.:29.00                         3rd Qu.:1.1393   3rd Qu.:0.72288  
 Max.   :50.00                         Max.   :8.2660   Max.   :2.12909  
  PROX_HAWKER        PROX_PARK       PROX_SUPERMARKET      PROX_MALL     
 Min.   :0.04949   Min.   :0.06004   Min.   :0.0000007   Min.   :0.0000  
 1st Qu.:0.44137   1st Qu.:0.48334   1st Qu.:0.1808138   1st Qu.:0.3669  
 Median :0.76583   Median :0.71014   Median :0.2784712   Median :0.5474  
 Mean   :0.88310   Mean   :0.79447   Mean   :0.2984345   Mean   :0.6301  
 3rd Qu.:1.17913   3rd Qu.:1.01433   3rd Qu.:0.3934945   3rd Qu.:0.8079  
 Max.   :6.63383   Max.   :6.00302   Max.   :1.6713100   Max.   :4.0092  
 WITHIN_350M_KINDERGARTEN WITHIN_350M_CHILDCARE WITHIN_350M_BUS 
 Min.   :0.000            Min.   : 0.000        Min.   : 0.000  
 1st Qu.:0.000            1st Qu.: 3.000        1st Qu.: 6.000  
 Median :1.000            Median : 4.000        Median : 8.000  
 Mean   :1.075            Mean   : 3.952        Mean   : 8.141  
 3rd Qu.:1.000            3rd Qu.: 5.000        3rd Qu.:10.000  
 Max.   :8.000            Max.   :20.000        Max.   :19.000  
    PROX_CBD      WITHIN_1KM_PRISCH PROX_GOOD_PRISCH  
 Min.   : 1.611   Min.   :0.000     Min.   : 0.07958  
 1st Qu.:10.637   1st Qu.:2.000     1st Qu.: 2.05408  
 Median :13.509   Median :3.000     Median : 3.17117  
 Mean   :12.704   Mean   :3.419     Mean   : 3.37962  
 3rd Qu.:15.250   3rd Qu.:4.000     3rd Qu.: 4.60464  
 Max.   :23.277   Max.   :9.000     Max.   :13.01618  

7.1 EDA with Statistical Graphs

summary(train$resale_price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 350000  528000  595000  627297  690000 1418000 
ggplot(data=train, aes(x=`resale_price`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

We can observe a right skewed distribution of HDB resale prices from the above histogram plot.

To visualise the shape of independent variables, use a multiple histogram plot below.

FLOOR_AREA_SQM <- ggplot(data=train, aes(x= `floor_area_sqm`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

LEASE_YRS <- ggplot(data=train, aes(x= `remaining_lease`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

STOREY_ORDER <- ggplot(data=train, aes(x= `storey_order`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

UNIT_AGE <- ggplot(data=train, aes(x= `UNIT_AGE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_ELDERCARE <- ggplot(data=train, aes(x= `PROX_ELDERCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MRT <- ggplot(data=train, aes(x= `PROX_MRT`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_HAWKER <- ggplot(data=train, aes(x= `PROX_HAWKER`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_PARK <- ggplot(data=train, aes(x= `PROX_PARK`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_SPM <- ggplot(data=train, aes(x= `PROX_SUPERMARKET`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MALL <- ggplot(data=train, aes(x= `PROX_MALL`)) +
  geom_histogram(bins=20, color="black", fill="light blue")


PROX_CBD <- ggplot(data=train, aes(x= `PROX_CBD`)) +
  geom_histogram(bins=20, color="black", fill="light blue")


PROX_TOP_SCH <- ggplot(data=train, 
                               aes(x= `PROX_GOOD_PRISCH`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

ggarrange(FLOOR_AREA_SQM, LEASE_YRS, STOREY_ORDER, UNIT_AGE, PROX_ELDERCARE, PROX_MRT, PROX_HAWKER, PROX_PARK, PROX_SPM, PROX_MALL, PROX_CBD, PROX_TOP_SCH,  
          ncol = 2, nrow = 6 )

kiv- floor area, unit age, remaining lease, a bit of prox to cbd binomial distribution with 2 peaks?

NUM_KG <- ggplot(data=train, aes(x= `WITHIN_350M_KINDERGARTEN`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_CC <- ggplot(data=train, aes(x= `WITHIN_350M_CHILDCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_BUS <- ggplot(data=train, aes(x= `WITHIN_350M_BUS`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_PRISCH <- ggplot(data=train, aes(x= `WITHIN_1KM_PRISCH`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

ggarrange(NUM_KG, NUM_CC, NUM_BUS, NUM_PRISCH, ncol= 2, nrow= 2)

Interpretation:

7.2 Statistical Point Map

tmap_mode("view")
tmap_options(check.and.fix = TRUE)

[kiv the zoom limits, position map properly]

tm_shape(mpsz_sf)+
  tm_polygons() +
tm_shape(train) +  
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11, 15))
tmap_mode("plot")

we can observe that the HDB resale prices are high in the CBD, southerns and central areas of singapore 720k to 1.4mil west and northern districts have cheaper resale flats 350 to 625k

8 Hedonic Pricing Model

kiv explanatory vs predictive model: https://towardsdatascience.com/i-built-a-predictive-model-is-b-s-a905077fe4#:~:text=Explanatory%20models%20are%20retrospective.,both%20bias%20and%20estimation%20variance.

drop geometry to plot correlation matrix to check multi-colinearity

train_nogeom <- train %>% st_drop_geometry()

corrplot(cor(train_nogeom[, 2:17]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

Interpretation: The correlation matrix above shows that all the correlation values are below 0.8. Hence, there is no sign of multicolinearity. / Matrix reorder is very important for mining the hiden structure and patter in the matrix. There are four methods in corrplot (parameter order), named “AOE”, “FPC”, “hclust”, “alphabet”. In the code chunk above, AOE order is used. It orders the variables by using the angular order of the eigenvectors method suggested by Michael Friendly.

From the scatterplot matrix, it is clear that Freehold is highly correlated to LEASE_99YEAR. In view of this, it is wiser to only include either one of them in the subsequent model building. As a result, LEASE_99YEAR is excluded in the subsequent model building.

train <- subset(train, select= -c(UNIT_AGE))
test <- subset(test, select= -c(UNIT_AGE))

8.1 OLS Predictive Model

OLS multiple linear regression

train_nogeom <- train %>% st_drop_geometry()

price_mlr <- lm(resale_price ~ .,
                data=train_nogeom)
summary(price_mlr)

Call:
lm(formula = resale_price ~ ., data = train_nogeom)

Residuals:
    Min      1Q  Median      3Q     Max 
-324299  -54097   -2735   48780  715869 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -165691.0    19744.8  -8.392  < 2e-16 ***
floor_area_sqm              5530.3      126.6  43.675  < 2e-16 ***
remaining_lease             5588.6       79.5  70.295  < 2e-16 ***
storey_order               19992.5      352.5  56.723  < 2e-16 ***
PROX_ELDERCARE              1002.0     1180.4   0.849  0.39595    
PROX_MRT                  -25078.5     2016.2 -12.439  < 2e-16 ***
PROX_HAWKER               -32340.1     1382.1 -23.399  < 2e-16 ***
PROX_PARK                   7980.0     1752.5   4.554 5.32e-06 ***
PROX_SUPERMARKET           -8458.9     4651.4  -1.819  0.06900 .  
PROX_MALL                 -23894.5     2090.7 -11.429  < 2e-16 ***
WITHIN_350M_KINDERGARTEN    8418.9      668.8  12.588  < 2e-16 ***
WITHIN_350M_CHILDCARE      -3877.4      390.7  -9.924  < 2e-16 ***
WITHIN_350M_BUS              534.2      245.4   2.177  0.02952 *  
PROX_CBD                  -19925.7      239.9 -83.054  < 2e-16 ***
WITHIN_1KM_PRISCH         -14105.9      502.4 -28.078  < 2e-16 ***
PROX_GOOD_PRISCH           -1322.9      418.2  -3.163  0.00156 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 83390 on 14503 degrees of freedom
Multiple R-squared:  0.6746,    Adjusted R-squared:  0.6742 
F-statistic:  2004 on 15 and 14503 DF,  p-value: < 2.2e-16

multi-colinearity test

vif <- ols_vif_tol(price_mlr)
vif
                  Variables Tolerance      VIF
1            floor_area_sqm 0.5598668 1.786139
2           remaining_lease 0.5253518 1.903486
3              storey_order 0.8703427 1.148973
4            PROX_ELDERCARE 0.8084843 1.236882
5                  PROX_MRT 0.8038367 1.244034
6               PROX_HAWKER 0.7643724 1.308263
7                 PROX_PARK 0.8262207 1.210330
8          PROX_SUPERMARKET 0.8722162 1.146505
9                 PROX_MALL 0.7694176 1.299684
10 WITHIN_350M_KINDERGARTEN 0.9239515 1.082308
11    WITHIN_350M_CHILDCARE 0.7381697 1.354702
12          WITHIN_350M_BUS 0.8500367 1.176420
13                 PROX_CBD 0.5009505 1.996205
14        WITHIN_1KM_PRISCH 0.6576783 1.520500
15         PROX_GOOD_PRISCH 0.8104372 1.233902

We can observe that all vif values are well below 5, indicating there is no multi-colinearity between the independent variables.

8.1.1 Publication Quality Table

Using gt summary package, it produces a summary table for the OLS model output for HDB resale price data.

tbl_regression(price_mlr, intercept = TRUE) %>% 
  add_glance_source_note(
    label = list(sigma ~ "\U03C3"),
    include = c(r.squared, adj.r.squared, 
                AIC, statistic,
                p.value, sigma))
Characteristic Beta 95% CI1 p-value
(Intercept) -165,691 -204,393, -126,989 <0.001
floor_area_sqm 5,530 5,282, 5,778 <0.001
remaining_lease 5,589 5,433, 5,744 <0.001
storey_order 19,992 19,302, 20,683 <0.001
PROX_ELDERCARE 1,002 -1,312, 3,316 0.4
PROX_MRT -25,079 -29,030, -21,127 <0.001
PROX_HAWKER -32,340 -35,049, -29,631 <0.001
PROX_PARK 7,980 4,545, 11,415 <0.001
PROX_SUPERMARKET -8,459 -17,576, 658 0.069
PROX_MALL -23,894 -27,993, -19,796 <0.001
WITHIN_350M_KINDERGARTEN 8,419 7,108, 9,730 <0.001
WITHIN_350M_CHILDCARE -3,877 -4,643, -3,112 <0.001
WITHIN_350M_BUS 534 53, 1,015 0.030
PROX_CBD -19,926 -20,396, -19,455 <0.001
WITHIN_1KM_PRISCH -14,106 -15,091, -13,121 <0.001
PROX_GOOD_PRISCH -1,323 -2,143, -503 0.002
R² = 0.675; Adjusted R² = 0.674; AIC = 370,258; Statistic = 2,004; p-value = <0.001; σ = 83,388
1 CI = Confidence Interval

8.1.2 Check Non-linearity

ols_plot_resid_fit(price_mlr)

scattered around 0 line, dependent var - price and independent variables are linear

8.1.3 Check Multicolinearity

should be ok since dropped away columns

ols_vif_tol(price_mlr)
                  Variables Tolerance      VIF
1            floor_area_sqm 0.5598668 1.786139
2           remaining_lease 0.5253518 1.903486
3              storey_order 0.8703427 1.148973
4            PROX_ELDERCARE 0.8084843 1.236882
5                  PROX_MRT 0.8038367 1.244034
6               PROX_HAWKER 0.7643724 1.308263
7                 PROX_PARK 0.8262207 1.210330
8          PROX_SUPERMARKET 0.8722162 1.146505
9                 PROX_MALL 0.7694176 1.299684
10 WITHIN_350M_KINDERGARTEN 0.9239515 1.082308
11    WITHIN_350M_CHILDCARE 0.7381697 1.354702
12          WITHIN_350M_BUS 0.8500367 1.176420
13                 PROX_CBD 0.5009505 1.996205
14        WITHIN_1KM_PRISCH 0.6576783 1.520500
15         PROX_GOOD_PRISCH 0.8104372 1.233902

8.1.4 Check Normality Assumption

https://olsrr.rsquaredacademy.com/reference/ols_plot_resid_hist.html

ols_plot_resid_hist(price_mlr)

8.1.5 Check Spatial Autocorrelation

8.1.6 Predict Using Test Data

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/predict.lm

price_mlr_pred <- predict(price_mlr,
             data=test,
             se.fit = TRUE,
             interval = "prediction")
summary(price_mlr_pred)
               Length Class  Mode   
fit            43557  -none- numeric
se.fit         14519  -none- numeric
df                 1  -none- numeric
residual.scale     1  -none- numeric
price_mlr_pred$residual.scale
[1] 83388.49

compare residual from prediction model - 83388.49 vs residual from explanatory model - 83390

kiv

summary(price_mlr_pred$fit)
      fit               lwr               upr         
 Min.   : 108035   Min.   : -58559   Min.   : 274628  
 1st Qu.: 542859   1st Qu.: 379321   1st Qu.: 706467  
 Median : 612140   Median : 448599   Median : 775661  
 Mean   : 627297   Mean   : 463755   Mean   : 790839  
 3rd Qu.: 685101   3rd Qu.: 521575   3rd Qu.: 848660  
 Max.   :1170458   Max.   :1006807   Max.   :1334109  

8.2 GWR predictive

8.2.1 Adaptive Bandwidth

train_sp <- as_Spatial(train)
train_sp
class       : SpatialPointsDataFrame 
features    : 14519 
extent      : 6958.193, 42645.18, 28157.26, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 16
names       : resale_price, floor_area_sqm, remaining_lease, storey_order,       PROX_ELDERCARE,           PROX_MRT,        PROX_HAWKER,          PROX_PARK,     PROX_SUPERMARKET,            PROX_MALL, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS,         PROX_CBD, WITHIN_1KM_PRISCH, ... 
min values  :       350000,             99,           49.08,            1, 3.24852165967354e-07, 0.0229291315265009, 0.0494878683203785, 0.0600396043825887, 7.47984318619627e-07, 2.21189111471176e-12,                        0,                     0,               0,  1.6109563636452,                 0, ... 
max values  :      1418000,            167,           96.75,           17,     8.26597091415839,   2.12908590009577,   6.63382585101601,   6.00301683925507,     1.67131003223853,     4.00920093399656,                        8,                    20,              19, 23.2773731998479,                 9, ... 

determine optimal adaptive bw

bw_adaptive <- bw.gwr(resale_price ~.,
                  data=train_sp,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE)
write_rds(bw_adaptive, "data/rds/bw_adaptive.rds") 
gwr_adaptive <- gwr.basic(formula = resale_price ~
                            floor_area_sqm + storey_order +
                            remaining_lease + PROX_CBD + 
                            PROX_ELDERCARE + PROX_HAWKER +
                            PROX_MRT + PROX_PARK + PROX_MALL + 
                            PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                            WITHIN_350M_CHILDCARE + WITHIN_350M_BUS +
                            WITHIN_1KM_PRISCH + PROX_GOOD_PRISCH,
                          data=train_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)
write_rds(gwr_adaptive, "data/rds/gwr_adaptive.rds")
gwr_adaptive <- read_rds("data/rds/gwr_adaptive.rds")
gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-17 08:52:17 
   Call:
   gwr.basic(formula = resale_price ~ floor_area_sqm + storey_order + 
    remaining_lease + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + 
    PROX_MRT + PROX_PARK + PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN + 
    WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH + 
    PROX_GOOD_PRISCH, data = train_sp, bw = bw_adaptive, kernel = "gaussian", 
    adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  resale_price
   Independent variables:  floor_area_sqm storey_order remaining_lease PROX_CBD PROX_ELDERCARE PROX_HAWKER PROX_MRT PROX_PARK PROX_MALL PROX_SUPERMARKET WITHIN_350M_KINDERGARTEN WITHIN_350M_CHILDCARE WITHIN_350M_BUS WITHIN_1KM_PRISCH PROX_GOOD_PRISCH
   Number of data points: 14519
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-324299  -54097   -2735   48780  715869 

   Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
   (Intercept)              -165691.0    19744.8  -8.392  < 2e-16 ***
   floor_area_sqm              5530.3      126.6  43.675  < 2e-16 ***
   storey_order               19992.5      352.5  56.723  < 2e-16 ***
   remaining_lease             5588.6       79.5  70.295  < 2e-16 ***
   PROX_CBD                  -19925.7      239.9 -83.054  < 2e-16 ***
   PROX_ELDERCARE              1002.0     1180.4   0.849  0.39595    
   PROX_HAWKER               -32340.1     1382.1 -23.399  < 2e-16 ***
   PROX_MRT                  -25078.5     2016.2 -12.439  < 2e-16 ***
   PROX_PARK                   7980.0     1752.5   4.554 5.32e-06 ***
   PROX_MALL                 -23894.5     2090.7 -11.429  < 2e-16 ***
   PROX_SUPERMARKET           -8458.9     4651.4  -1.819  0.06900 .  
   WITHIN_350M_KINDERGARTEN    8418.9      668.8  12.588  < 2e-16 ***
   WITHIN_350M_CHILDCARE      -3877.4      390.7  -9.924  < 2e-16 ***
   WITHIN_350M_BUS              534.2      245.4   2.177  0.02952 *  
   WITHIN_1KM_PRISCH         -14105.9      502.4 -28.078  < 2e-16 ***
   PROX_GOOD_PRISCH           -1322.9      418.2  -3.163  0.00156 ** 

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 83390 on 14503 degrees of freedom
   Multiple R-squared: 0.6746
   Adjusted R-squared: 0.6742 
   F-statistic:  2004 on 15 and 14503 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 1.008486e+14
   Sigma(hat): 83348.27
   AIC:  370258.4
   AICc:  370258.5
   BIC:  356031.2
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 69 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                                   Min.     1st Qu.      Median     3rd Qu.
   Intercept                -3.7946e+08 -9.6724e+05 -1.4552e+05  3.6217e+06
   floor_area_sqm           -1.6034e+05  1.1165e+03  3.2034e+03  5.8176e+03
   storey_order              3.3585e+03  1.1693e+04  1.4186e+04  1.7944e+04
   remaining_lease          -7.3105e+04 -2.1272e+03  4.0938e+03  6.6691e+03
   PROX_CBD                 -8.8341e+07 -2.8883e+05 -2.0156e+04  9.7493e+04
   PROX_ELDERCARE           -1.6279e+07 -4.7141e+04  1.3052e+04  9.1405e+04
   PROX_HAWKER              -6.7649e+06 -5.5437e+04  3.6489e+03  1.1450e+05
   PROX_MRT                 -7.8455e+06 -1.1892e+05 -4.9072e+04  5.2506e+04
   PROX_PARK                -3.5965e+07 -9.6154e+04 -1.0960e+04  6.3723e+04
   PROX_MALL                -2.8128e+07 -1.0545e+05 -2.3200e+04  5.9033e+04
   PROX_SUPERMARKET         -6.7266e+06 -9.3833e+04 -1.8712e+04  5.2123e+04
   WITHIN_350M_KINDERGARTEN -3.5303e+05 -9.4771e+03 -6.3515e+02  7.4898e+03
   WITHIN_350M_CHILDCARE    -9.7906e+04 -3.6575e+03  4.6350e+02  5.2338e+03
   WITHIN_350M_BUS          -4.0644e+04 -2.6840e+03  7.9049e+01  2.7765e+03
   WITHIN_1KM_PRISCH        -1.5107e+06 -6.7399e+03  6.6886e+01  7.2958e+03
   PROX_GOOD_PRISCH         -3.8384e+07 -1.0992e+05  1.7128e+04  2.4181e+05
                                 Max.
   Intercept                791294700
   floor_area_sqm              553894
   storey_order                 28264
   remaining_lease              12809
   PROX_CBD                  37653414
   PROX_ELDERCARE             9015002
   PROX_HAWKER               36079132
   PROX_MRT                  13087576
   PROX_PARK                 12916427
   PROX_MALL                 15349887
   PROX_SUPERMARKET           4282686
   WITHIN_350M_KINDERGARTEN    222801
   WITHIN_350M_CHILDCARE        62437
   WITHIN_350M_BUS              25374
   WITHIN_1KM_PRISCH           190039
   PROX_GOOD_PRISCH          97952173
   ************************Diagnostic information*************************
   Number of data points: 14519 
   Effective number of parameters (2trace(S) - trace(S'S)): 1454.963 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 13064.04 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 353806.5 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 352433.6 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 347927.4 
   Residual sum of squares: 2.732835e+13 
   R-square value:  0.9118132 
   Adjusted R-square value:  0.901991 

   ***********************************************************************
   Program stops at: 2023-03-17 08:56:48 

8.2.2 Visualise GWR output

resale.sf.adaptive <- st_as_sf(gwr_adaptive$SDF) %>%
  st_transform(crs=3414)
resale.sf.adaptive.svy21 <- st_transform(resale.sf.adaptive, 3414)
resale.sf.adaptive.svy21  
Simple feature collection with 14519 features and 54 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6958.193 ymin: 28157.26 xmax: 42645.18 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
First 10 features:
     Intercept floor_area_sqm storey_order remaining_lease    PROX_CBD
1  -543922.315       5798.420     19921.33        9898.314   -858.6719
2  -455784.609       4884.513     19901.94        7705.634   2266.5729
3  -512073.474       4490.400     22228.63        7884.370  38683.0292
4   -25274.000       1843.624     24463.34        7990.419 -11223.3874
5  -626888.561       4668.305     21565.30        8089.424  42245.2023
6     1719.449       2941.478     23222.81        8571.491 -24507.6269
7  -557333.913       5302.043     19486.07        7505.329  11773.1036
8  -415412.903       4491.243     21907.71        7744.784  27745.8712
9  -337832.948       4516.634     20322.93        7771.730  -7123.5491
10 -910529.170       4195.171     21988.23        8439.962  71590.8110
   PROX_ELDERCARE PROX_HAWKER  PROX_MRT   PROX_PARK  PROX_MALL PROX_SUPERMARKET
1       -13072.20   -45495.08 -75797.72   76873.781  -93706.87         46355.87
2       -34378.12   134933.71 101221.84 -198916.246  -56710.20        -52827.67
3      -131693.52   -69535.73  51143.76 -153356.715 -114055.72        -19813.92
4        19579.72    77084.86  28637.83  -34017.472  -13903.67         20859.65
5      -172555.91    17625.06 -35577.00 -159535.592   39733.35        -47020.34
6        54207.70    10721.96 -43423.70   -6239.236  -49549.46         41126.84
7       -54903.67   171371.68 140386.22 -260819.748  -68210.93        -70784.58
8      -133868.63   -43581.03  56678.64 -164278.720 -103482.58        -47493.21
9       -37380.73   145880.12  82402.56 -211756.073  -42026.48        -55373.41
10     -168225.94   -84533.06 -86020.14  -80346.241   21808.80         84107.22
   WITHIN_350M_KINDERGARTEN WITHIN_350M_CHILDCARE WITHIN_350M_BUS
1                 -3702.856              4396.732      -3620.1390
2                 12899.984              7369.359       8876.7757
3                 10702.397              7523.916      -2211.8918
4                 18586.932             -1397.015       1121.0196
5                 15323.347             11072.893       2617.3135
6                 20746.176             -3234.304       3210.2936
7                  8886.546              8369.911       5682.4560
8                 12755.269              6366.276        419.2191
9                 12148.770              5232.924       8253.1675
10                -1405.150             13991.956      -2885.0542
   WITHIN_1KM_PRISCH PROX_GOOD_PRISCH      y     yhat   residual CV_Score
1         -19651.954       -14798.835 483000 520774.3  -37774.29        0
2         -22308.331        -4372.036 590000 678469.4  -88469.37        0
3         -52699.212        19801.413 629000 705712.4  -76712.40        0
4          -1117.886       -35827.116 670000 812096.3 -142096.27        0
5         -60158.652         5470.719 680000 732430.0  -52430.00        0
6          -4990.206       -47860.821 760000 847500.1  -87500.08        0
7         -23742.899        10531.956 768000 871399.0 -103398.99        0
8         -56144.137        20399.417 816800 847604.5  -30804.54        0
9         -18145.095         2667.099 820000 869514.4  -49514.42        0
10        -33729.036        22267.276 830000 931320.1 -101320.12        0
   Stud_residual Intercept_SE floor_area_sqm_SE storey_order_SE
1     -0.8742109    158542.20          844.4323        1457.736
2     -2.0152387    126071.62          559.7671        1303.619
3     -1.7477976    259802.12          946.9022        1739.495
4     -3.2541165    112750.46          619.1971        1592.735
5     -1.3833479    199621.63          721.5796        1436.238
6     -2.1093872    120678.76          550.8411        1347.255
7     -2.4053112    146846.13          631.9891        1401.630
8     -0.7453731    212127.87          806.7247        1605.774
9     -1.1171237     96618.68          523.2458        1182.313
10    -2.2650800    282399.65          927.5423        1583.970
   remaining_lease_SE PROX_CBD_SE PROX_ELDERCARE_SE PROX_HAWKER_SE PROX_MRT_SE
1            351.3113   12710.022          20457.23       29693.67    19004.44
2            318.8009   10002.732          15566.87       27512.51    21093.18
3            490.2410   24085.776          44560.68       67283.18    43678.88
4            381.0479    8815.157          18407.69       27153.58    20167.38
5            350.8117   16475.946          33591.89       46095.21    34232.89
6            318.0177    9926.553          17975.10       23531.16    22137.01
7            353.2831   11466.768          18170.32       31947.06    22842.01
8            420.3836   19064.802          34241.56       51584.02    35408.10
9            299.4600    6553.453          14554.28       22167.24    17944.33
10           444.1576   26387.322          50554.34       75087.50    44581.31
   PROX_PARK_SE PROX_MALL_SE PROX_SUPERMARKET_SE WITHIN_350M_KINDERGARTEN_SE
1      35935.86     32703.81            45755.20                    8479.763
2      31876.20     14970.65            30519.60                    3540.085
3      60862.84     29512.47            62214.35                   19820.525
4      39333.19     15078.13            25270.32                    3531.364
5      38123.20     38653.62            48824.18                    9332.557
6      34130.16     13448.84            23398.72                    3764.996
7      37662.12     18007.13            36581.97                    4217.026
8      47536.44     24376.65            44675.07                   14818.855
9      30100.09     13352.78            26761.09                    3203.425
10     62048.51     67936.07            76298.99                   19576.334
   WITHIN_350M_CHILDCARE_SE WITHIN_350M_BUS_SE WITHIN_1KM_PRISCH_SE
1                  4264.484           2371.936            10270.238
2                  2819.247           1885.707             5254.453
3                  5666.858           3444.424            17149.171
4                  2386.609           1704.021             4857.579
5                  4526.847           2575.145            12136.390
6                  2553.898           1622.145             5134.633
7                  3211.516           2288.978             6346.090
8                  4879.843           2957.016            13329.807
9                  2524.873           1693.603             4260.269
10                 6537.305           3417.585            19005.268
   PROX_GOOD_PRISCH_SE Intercept_TV floor_area_sqm_TV storey_order_TV
1            15694.359  -3.43077316          6.866649        13.66594
2            11375.964  -3.61528316          8.725974        15.26669
3            20174.625  -1.97101346          4.742200        12.77878
4            13683.786  -0.22415874          2.977442        15.35933
5            13881.186  -3.14038396          6.469563        15.01513
6            11969.928   0.01424815          5.339976        17.23714
7            12524.443  -3.79535979          8.389453        13.90243
8            15800.972  -1.95831363          5.567256        13.64309
9             9994.823  -3.49655945          8.631954        17.18914
10           22951.966  -3.22425741          4.522889        13.88172
   remaining_lease_TV PROX_CBD_TV PROX_ELDERCARE_TV PROX_HAWKER_TV PROX_MRT_TV
1            28.17534 -0.06755865        -0.6390015     -1.5321476   -3.988420
2            24.17068  0.22659538        -2.2084151      4.9044485    4.798794
3            16.08264  1.60605286        -2.9553749     -1.0334786    1.170904
4            20.96959 -1.27319197         1.0636710      2.8388472    1.420007
5            23.05917  2.56405316        -5.1368318      0.3823622   -1.039264
6            26.95287 -2.46889602         3.0157100      0.4556496   -1.961589
7            21.24452  1.02671508        -3.0216126      5.3642390    6.145966
8            18.42314  1.45534534        -3.9095370     -0.8448552    1.600725
9            25.95248 -1.08699173        -2.5683676      6.5808891    4.592123
10           19.00218  2.71307606        -3.3276259     -1.1257940   -1.929511
   PROX_PARK_TV PROX_MALL_TV PROX_SUPERMARKET_TV WITHIN_350M_KINDERGARTEN_TV
1     2.1391943   -2.8653201           1.0131280                 -0.43666972
2    -6.2402747   -3.7880914          -1.7309428                  3.64397544
3    -2.5197101   -3.8646617          -0.3184783                  0.53996538
4    -0.8648541   -0.9221085           0.8254601                  5.26338535
5    -4.1847374    1.0279334          -0.9630545                  1.64192364
6    -0.1828071   -3.6842937           1.7576531                  5.51027800
7    -6.9252545   -3.7879953          -1.9349579                  2.10730174
8    -3.4558481   -4.2451518          -1.0630808                  0.86074597
9    -7.0350651   -3.1473957          -2.0691763                  3.79243154
10   -1.2948939    0.3210195           1.1023372                 -0.07177801
   WITHIN_350M_CHILDCARE_TV WITHIN_350M_BUS_TV WITHIN_1KM_PRISCH_TV
1                 1.0310116         -1.5262378           -1.9134857
2                 2.6139454          4.7074005           -4.2456050
3                 1.3277050         -0.6421659           -3.0729889
4                -0.5853557          0.6578671           -0.2301324
5                 2.4460497          1.0163752           -4.9568819
6                -1.2664185          1.9790428           -0.9718720
7                 2.6062185          2.4825301           -3.7413429
8                 1.3046069          0.1417710           -4.2119241
9                 2.0725495          4.8731405           -4.2591426
10                2.1403248         -0.8441792           -1.7747203
   PROX_GOOD_PRISCH_TV  Local_R2                  geometry
1           -0.9429398 0.9475410 POINT (30820.82 39547.58)
2           -0.3843222 0.9241404 POINT (29412.84 38680.21)
3            0.9815009 0.9331871 POINT (29982.66 39489.71)
4           -2.6182165 0.9254545  POINT (28149.18 39053.5)
5            0.3941104 0.9317547 POINT (30053.55 38976.12)
6           -3.9984217 0.9056712  POINT (28526.9 39944.36)
7            0.8409121 0.9290346 POINT (29507.16 38473.85)
8            1.2910230 0.9270069  POINT (29872.38 39327.2)
9            0.2668480 0.9318115 POINT (29333.46 38563.53)
10           0.9701686 0.9440498 POINT (30264.97 39315.69)
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz_sf)+
  tm_polygons(alpha = 0.1) +
tm_shape(resale.sf.adaptive) +  
  tm_dots(col = "Local_R2",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))

Local R2: these values range between 0.0 and 1.0 and indicate how well the local regression model fits observed y values. Very low values indicate the local model is performing poorly. Mapping the Local R2 values to see where GWR predicts well and where it predicts poorly may provide clues about important variables that may be missing from the regression model.

8.3 GWR model

random forest model, using ranger package spatial ML introduce

8.3.1 Preparing Coordinates Data

coords_train <- st_coordinates(train)
coords_test <- st_coordinates(test)
coords_train <- write_rds(coords_train, "data/rds/coords_train.rds" )
coords_test <- write_rds(coords_test, "data/rds/coords_test.rds" )

https://rdrr.io/cran/SpatialML/man/grf.html

testing w bw=69 (neighbours) - from gwr adaptive bw computation for gwr model

set.seed(1234)
gwRF_adaptive <- grf(formula = resale_price ~
                            floor_area_sqm + storey_order +
                            remaining_lease + PROX_CBD + 
                            PROX_ELDERCARE + PROX_HAWKER +
                            PROX_MRT + PROX_PARK + PROX_MALL + 
                            PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                            WITHIN_350M_CHILDCARE + WITHIN_350M_BUS +
                            WITHIN_1KM_PRISCH + PROX_GOOD_PRISCH,
                     dframe=train_nogeom, 
                     bw=69,
                     kernel="adaptive",
                     coords=coords_train,
                     ntree=30)
write_rds(gwRF_adaptive, "data/rds/gwRF_adaptive.rds")
gwRF_adaptive <- read_rds("data/rds/gwRF_adaptive.rds")

predict using GRF random forest on test data https://rdrr.io/cran/SpatialML/man/predict.grf.html

test <- cbind(test, coords_test) %>%
  st_drop_geometry()
gwRF_pred <- predict.grf(gwRF_adaptive, 
                           test, 
                           x.var.name="X",
                           y.var.name="Y", 
                           local.w=1,
                           global.w=0)
GRF_PRED <- write_rds(gwRF_pred, "data/rds/GRF_PRED.rds")
GRF_pred <- read_rds("data/rds/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)

Interpreting results of random forest predictive model

test_data_p <- cbind(test, GRF_pred_df)
write_rds(test_data_p,"data/rds/test_data_p.rds")
rmse(test_data_p$resale_price, 
     test_data_p$GRF_pred)
[1] 56716.64
ggplot(data = test_data_p,
       aes(x = GRF_pred,
           y = resale_price)) +
  geom_point()

^ means a good fit of the model along the diagonal line (gwr accurate prediction?)

8.4 Comparison between OLS, GWR Model